Skip to content

Conversation

@rynewang
Copy link

@rynewang rynewang commented Jan 3, 2026

Rationale for this change

Fixes #36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice

What changes are included in this PR?

The bug happens because:

  1. Header is written to data_buffer_ and flushed during CSVWriterImpl initialization
  2. The buffer is not cleared after flush
  3. When the next batch is empty, TranslateMinimalBatch returns early without modifying data_buffer_
  4. The write loop then writes data_buffer_ which still contains stale content

The fix introduces a WriteAndClearBuffer() helper that writes the buffer to sink and clears it. This helper is used in all write paths:

  • WriteHeader()
  • WriteRecordBatch()
  • WriteTable()

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

Are these changes tested?

Yes. Added C++ tests in writer_test.cc and Python tests in test_csv.py:

  • Empty batch at start of table
  • Empty batch in middle of table

Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

…ch is empty

When writing CSV, if the first record batch was empty, the header would be
written twice. This happened because:

1. Header is written to data_buffer_ and flushed during initialization
2. TranslateMinimalBatch returns early for empty batches without modifying data_buffer_
3. The loop then writes data_buffer_ which still contains the header

The fix clears the buffer (resize to 0) when encountering an empty batch,
so the subsequent write outputs nothing.

Added C++ and Python tests for empty batches at start and in middle of tables.

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=1%)
Claude-Steers: 2
Claude-Permission-Prompts: 2
Claude-Escapes: 1
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

⚠️ GitHub issue #36889 has been automatically assigned in GitHub to PR creator.

@rynewang
Copy link
Author

rynewang commented Jan 6, 2026

@wgtmac Hi would you mind to also review this? Thanks!

@HuaHuaY
Copy link
Contributor

HuaHuaY commented Jan 9, 2026

I think we can clear data_buffer_ at last line of writeHeader. Relying on the Resize operation in TranslateMinimalBatch to perform implicit clear operation is not very readable.

Addresses review feedback from HuaHuaY: Instead of clearing the buffer
in TranslateMinimalBatch for empty batches, use a WriteAndClearBuffer()
helper that writes and clears the buffer in all write paths.

This is cleaner because:
- Every write follows the same pattern (write -> clear)
- Easier to reason about for future write stages
- The invariant is explicit: buffer is always clean after flush
@rynewang
Copy link
Author

Clearing after writeHeader is a great idea! However it's not complete and failed my unit tests because there are other Write* methods. Now, I made a helper function and let all Write* to do "write to sink and clear buffer". @HuaHuaY can you help pointing this to a maintainer for approval? Thanks!

@HuaHuaY
Copy link
Contributor

HuaHuaY commented Jan 13, 2026

cc @pitrou @wgtmac . Please take a look.

@github-actions
Copy link

⚠️ GitHub issue #36889 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 13, 2026
rynewang and others added 2 commits January 12, 2026 22:28
… tests

- Rename WriteAndClearBuffer to FlushToSink (shorter, clearer intent)
- Consolidate Python tests into a single parameterized test with 3 cases:
  empty batch at beginning, middle, and end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Python] Duplicate csv header when table batches start with empty

3 participants